Collective network routing

ABSTRACT

Disclosed are a unified method and apparatus to classify, route, and process injected data packets into a network so as to belong to a plurality of logical networks, each implementing a specific flow of data on top of a common physical network. The method allows to locally identify collectives of packets for local processing, such as the computation of the sum, difference, maximum, minimum, or other logical operations among the identified packet collective. Packets are injected together with a class-attribute and an opcode attribute. Network routers, employing the described method, use the packet attributes to look-up the class-specific route information from a local route table, which contains the local incoming and outgoing directions as part of the specifically implemented global data flow of the particular virtual network.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of Provisional Application No.60/625,026, for “Collective Network Routing,” filed Nov. 4, 2004.

GOVERNMENT CONTRACT

This invention was made with Government support under SubcontractB517552 under prime contract W-7405-ENG-48 awarded by The Department ofEnergy. The Government has certain rights in this invention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention generally relates to the field of high-speed digital dataprocessing systems; and more specifically, the invention relates tomethods and systems for routing messages in computer systems.

2. Background Art

Massively parallel computer systems comprise a large number of dataprocessing elements, which are typically connected using a network. Eachnode connected to the said network typically is comprised of a networkinterface and the local data processing elements. The network interfacereceives data from the network, which is addressed to this particularnode, and the network interface also injects the local results into thenetwork. Data is typically routed through the network in packets; andthe packets are routed by a plurality of routers, typically one routerper node. The network, specifically the plurality of the networkrouters, ensures the movement of the injected packets between theconnected nodes towards the desired packet destinations.

Typically, the node, which produces a data packet, specifies the desiredestination of that packet by specifically providing a unique address ofthe said packet destination. Upon injection of such an attributedpacket, the plurality of network routers make local routing decisions toincrementally reduce the distance of the packet to its destination byforwarding the packet to a connected node closer to the specifieddestination. This universal point-to-point style of communication isstate of the art and used by most of today's implemented computernetworks. The drawback of using addresses as part of the packetattributes is the limitation of the network scalability to the maximalnumber of addresses presentable with the bits dedicated to the packetaddress.

Furthermore, additional auxiliary networks have been used to implementspecial support for collective communication such as global broadcaststo all connected nodes (CM-5). These networks have typically thetopology of a tree or a fat tree, since the tree topology provides theminimal distance between any two connected nodes and, thus, minimalcommunication latency.

There are several constraints imposed to particular nodes by the treetopology. For example, the dedicated root node splits the network intotwo domains, left and right. Traffic from one domain targeted to theother domain must go through the root node under any circumstances. Abroken root-node, router and/or links, will render the entire networkuseless since no packets can be routed from the left to the rightpartition. In addition, leaf nodes in a tree network have only oneconnection to the network. If this link is broken, the entire network isalso not functional anymore.

SUMMARY OF THE INVENTION

An object of this invention is to provide an improved method and systemfor routing data packets through multi-node computer networks.

Another object of the invention is to avoid using addresses to routedata packets through computer networks by classifying the packets into alimited set of classes for which the packet behavior can be specified indetail on a per-node basis.

A further object of the present invention is to allow the nodes of amulti-node computer network to utilize an arbitrary number of linksbetween nodes while still assuring deterministic packet routes throughthe network and thus allow for well-defined, well-behaving collectiveoperations.

Another object of this invention is to route data packets through amulti-node computer network by employing a general address-less staticrouting method applicable to arbitrary network topologies, which allowsto embed a plurality of virtual logical networks in one physicalnetwork.

A further object of the invention is to enable a multi-node computersystem to define and process collective packet operations such as globalpacket reductions (global maximum or similar) on networks of arbitrarysize and shape.

These and other objectives are attained with a method of and a systemfor routing data packets in a computer network having a multitude ofnodes and a multitude of links connecting the nodes together, andwherein each data packet includes a class identifier. The methodcomprises the steps of, each node, for each of a defined set of datapackets, checking or looking at the data packet to identify the class ofthe data packet, and routing the data packet from the node based on theidentified class of the data packet.

The preferred embodiment of the invention, described in detail below,provides a method and apparatus for identifying collectives of packetsamong a plurality of packets on a network in a system of connected dataprocessing elements which are connected using an arbitrary networktopology. The preferred method yields local routing decisions as well asdecisions whether collective packet reduction operations or otheroperations should be applied to the identified packet collective. Thelocal result of the said collective packet operation is routed to thecollective of connected nodes and/or locally received. The preferredembodiment of the invention allows the specification of packet datareductions among an arbitrary set of nodes, connected using an arbitraryinterconnection topology. In addition to packet reductions, theinvention also can be utilized to multi-/broadcast packets among one ormore configurable sets of nodes in a network.

The preferred embodiment of the invention provides a number of importantadvantages. For instance, the invention avoids using addresses—and thusavoids their associated limitations—by classifying packets into alimited set of classes for which the packet behavior can be specified indetail on a per-node basis using, for example, local class descriptortables. Each class may have a virtually unlimited set of nodesparticipating.

In addition, this preferred embodiment allows nodes to utilize anarbitrary number of links between nodes while still assuringdeterministic packet routes through the network and, thus, allows forwell-defined, well-behaving collective operations. Since the effectivetopology is defined per packet class, the topology may be modifieddynamically, for example to compensate for broken links or to extend orshrink the affected network partition. Changes of the logical networktopology may also be transparent to the application.

Further, the preferred embodiment of the invention disclosed hereinsolves several problems of auxiliary networks by employing a generaladdress-less static routing method applicable to arbitrary networktopologies, which allows to embed a plurality of virtual logicalnetworks in one physical network; for example tree networks withredundant links or irregular networks. The absence of source or targetaddresses allows the application of this invention to networks of anysize and shape.

In addition, the invention, in its preferred embodiment, allows todefine and process collective packet operations such as global packetreductions (global sum or global maximum or similar) on networks ofarbitrary size and shape.

Further benefits and advantages of the invention will become apparentfrom a consideration of the following detailed description, given withreference to the accompanying drawings, which specify and show preferredembodiments of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the general architecture of a network comprising aplurality of nodes connected by an interconnection network of the degreen.

FIG. 2 describes the general structure of a single node of the networkof FIG. 1.

FIG. 3 depicts a sparsely connected network topology, suitable for theinvention described herein.

FIG. 4 shows a route descriptor table comprising n route descriptorsdescribing the packet behavior for a node with four network links plusone local client.

FIG. 5 shows one exemplary route descriptor configuration for class 0 ofthe two nodes A and B of the network shown in FIG. 3.

FIG. 6 illustrates a data packet that may be used in the practice ofthis invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The herein described invention solves the problem to describe packetroutes of single packets and to define packet collectives for collectivepacket operations among a plurality of nodes connected by a network witharbitrary topology of degree n as shown at 10 in FIG. 1. Each node 12itself comprises the network interface and the local client, whichcontains the processing elements for data processing of the receiveddata and for injecting results of the local computation into thenetwork.

The general structure of the network interface with four links, for anetwork of degree four, is also shown in FIG. 2. Each network linkcomprises a network receiver 14, which receives packets from the networklink and presents the packets to the arbiter 16, which routes thepackets, via sender 20, towards the targets specified using thecollective class routing method herein described. The network interfacealso includes a local client CPU and memory, represented at 22.

In particular, the arbiter 16 first evaluates the packet headerinformation, shown at 24 in FIG. 6, such as the class 26 of the packetand the specified packet opcode 30. Using the class information, thearbiter 16 retrieves the appropriate route descriptor from the routedescriptor table, shown at 32 in FIG. 4. The route descriptor table canbe read and written and contains the specific description 34 of thepacket behavior for all available packet classes for this particularnode. The route descriptor table may be different for each node in thenetwork, depending on the position of the node in the logical networkstructure and on the availability of physical connections to neighboringnodes.

The route descriptor table 32 may be initialized immediately afterbooting the local client. During runtime, packet classes may beallocated and initialized to implement a specific communication patternsuch as packet broadcast or a packet reduction. From that time on,packets, injected with the appropriate class-tag, follow the configuredpacket routes according to the route descriptors deposited in the nodesalong the packet path through the network. For collective packetoperations such as reductions, the route descriptor 34 also indicateswhich packets are members of the collective. The router 36 will waituntil the packet collective is complete and it will then forward thepacket collective while applying the specified packet operation.

The collective is considered to be complete if all channels which areidentified as source-channels have collective packets available. Whethera packet is considered a collective packet or not is specified on aper-packet basis using the packet opcode field 30 of the packet header22, shown in FIG. 6.

For the configuration shown in FIG. 5, for example, Node A considers apacket collective as complete if the receivers of channel 2, channel 1,and the local client signal the availability of a packet, which ismarked with an opcode such as ADD or MAX, which in turn indicates acollective operation. In that case, the router applies the specifiedoperation to the packet collective and routes the result of thatoperation to the sender of channel 0.

Packets, not marked as operands of collective packet operations, aresimply routed to the target channels without waiting for the othersources. In general the rule for forwarding packets is: if the packetenters the node from a channel which is marked as a source channel forthe given packet class, then the packet or the packet collective isrouted to the specified target channels of that class. If the channel isnot an explicit source channel, then the packet is routed to all thesource channels and no collective operation is applied even though thepacket opcode may request a collective operation.

The apparatus enabling the method of collective class routing, the routedescriptor table 32, is an array of registers, which can be read andwritten, comprising at least two bits per potential packet route(channel). One bit indicates whether the particular channel is adedicated source channel for that packet class. The other bit is set forall channels which are dedicated targets for packets of the given class.Thus the descriptor describes the packet routes and the packetcollective for collective operations.

With reference to FIG. 6, each packet header comprises the packet class26 and the packet opcode 30. The class is used to identify theappropriate packet routes, and the opcode is used to decide whethercollective operations should be applied. FIG. 5 shows an example of aparticular configuration for class 0 of the two nodes A and B of thenetwork shown in FIG. 3. It should be noted that a number ofconfigurations may coexist in parallel, each using a different classwith different sources and targets specified.

While it is apparent that the invention herein disclosed is wellcalculated to fulfill the objects stated above, it will be appreciatedthat numerous modifications and embodiments may be devised by thoseskilled in the art, and it is intended that the appended claims coverall such modifications and embodiments as fall within the true spiritand scope of the present invention.

1. A method of routing data packets in a computer network having amultitude of nodes and a multitude of links connecting the nodestogether, each data packet including a class identifier, the methodcomprising: each node, for each of a defined set of data packets,looking at the data packet to identify the class of the data packet; androuting the data packet from the node based on the identified class ofthe data packet.
 2. A method according to claim 1, wherein each nodeincludes a route descriptor table including one or more routedescriptors, each route descriptor (i) specifying a route from the node,and (ii) being associated with one of a plurality of packet classes, andwherein the routing step includes the steps of, each node, for each ofthe defined set of data packets, identifying the route descriptor, inthe route descriptor table of the node, associated with the class of thedata packet; and routing the data packet on the route specified by theidentified route descriptor.
 3. A method according to claim 2, whereineach node includes one or more channels, and each route descriptor tableincludes an array of registers comprising at least two bits for eachchannel of the node in which the descriptor table is located.
 4. Amethod according to claim 1, wherein at each node, the defined set ofdata packets includes data packets received at the node and data packetsoriginated at the node.
 5. A method as specified in claim 1, wherein anopcode associated with each packet or packet class indicates that theplurality of selected, specified input directions defines a localcollective of packets which is subject to the said collective packetoperation.
 6. A method as specified in claim 5, wherein the identifiedcollective of packets is subject to arithmetical/logical data processingoperations to combine the members of the packet collective therebyreducing the number of packets to be routed through the network.
 7. Amethod as specified in claim 5, wherein the identified collective ofpackets is subject to arithmetical/logical data processing operations tocompute a number of resulting packets based on the data of theidentified packet collective without reducing the number of packets tobe routed through the network.
 8. An apparatus for routing data packetsin a computer network having a multitude of nodes and a multitude oflinks connecting the nodes together, each data packet including a classidentifier, the apparatus comprising: a plurality of checking means,each of the checking means being located at a respective one of thenodes for checking each of a defined set of data packets, to identifythe class of the data packet; and a plurality of routing means, each ofthe routing means being located at a respective one of the nodes toroute data packets from the node based on the class of the data packets.9. Apparatus according to claim 8, further comprising a plurality ofroute descriptor tables, each of the route descriptor tables beinglocated at a respective one of the nodes, each route descriptor tableincluding one or more route descriptors, each route descriptor (i)specifying a route in the network and (ii) being associated with one ofa plurality of packet classes, and wherein each routing means includes:means for identifying the route descriptor, in the route descriptortable at the node at which the routing means is located, associated withthe identified class of the data packet; and means for directing thedata packet onto the route specified by the identified route descriptor.10. Apparatus according to claim 9, wherein each node includes one ormore channels, and each route descriptor table includes an array ofregisters comprising at least two bits for each channel of the node inwhich the descriptor table is located.
 11. Apparatus according to claim8, wherein, at each node, the defined set of data packets includes datapackets received at the node and data packets originated at the node.12. A method as specified in claim 8, wherein an opcode associated witheach packet or packet class indicates that the plurality of selected,specified input directions defines a local collective of packets whichis subject to the said collective packet operation.
 13. A method asspecified in claim 8, wherein in addition to packet routes alsospecifies a plurality of nodes on the network which participate in aparticular collective operation, comprising two additional bits perlocal route descriptor; one bit corresponding to the local contributionof the actual node to the said collective operation and another bitcorrelating to the local reception of the results of the said operation.14. The method specified in claim 13, wherein the packet class andopcode information is also subject to a collective operation such as butnot limited to substituting the class or opcode with certain predefinedvalues or increment/decrement operations.
 15. The method specified inclaim 14, wherein the packet class and/or opcode information is beingutilized to apply pre-processing steps, such as reverting the wordorder, to the data locally injected and/or to apply post-processingsteps to the said data.
 16. A program storage device readable bymachine, tangibly embodying a program of instructions executable by themachine to perform method steps for routing data packets in a computernetwork having a multitude of nodes and a multitude of links connectingthe nodes together, each data packet including a class identifier, saidmethod steps comprising: each node, for each of a defined set of datapackets, looking at the data packet to identify the class of the datapacket; and routing the data packet from the node based on theidentified class of the data packet.
 17. A program storage deviceaccording to claim 16, wherein each node includes a route descriptortable including one or more route descriptors, each route descriptor (i)specifying a route from the node, and (ii) being associated with one ofa plurality of packet classes, and wherein the routing step includes thesteps of, each node, for each of the defined set of data packets,identifying the route descriptor, in the route descriptor table of thenode, associated with the class of the data packet; and routing the datapacket on the route specified by the identified route descriptor.
 18. Aprogram storage device according to claim 17, wherein each node includesone or more channels, and each route descriptor table includes an arrayof registers comprising at least two bits for each channel of the nodein which the descriptor table is located.
 19. A program storage deviceaccording to claim 16, wherein at each node, the defined set of datapackets includes data packets received at the node and data packetsoriginated at the node.
 20. A program storage device according to claim16, wherein an opcode associated with each packet or packet classindicates that the plurality of selected, specified input directionsdefines a local collective of packets which is subject to the saidcollective packet operation.
 21. A program storage device according toclaim 20, wherein the identified collective of packets is subject toarithmetical/logical data processing operations to combine the membersof the packet collective thereby reducing the number of packets to berouted through the network.
 22. A method of identifying a collective ofdata packets on a computer system, the computer system including amultitude of interconnected processing nodes, and wherein a multitude ofdata packets are routed in the computer system, the method comprisingthe steps of: allocating a class identifier to identify a given class ofdata packets; providing each data packet in said given class with saidclass identifier; providing each of the nodes with a set of channels forreceiving and holding data packets; and each of at least some of thenodes, i) identifying a subset of the set of channels of the node, ii)evaluating the data packets at the node to identify the class of thedata packet, and iii) identifying said collective as complete when allof the channels of said subset have a data packet of the given class.23. A method according to claim 22, wherein each of said at least someof the nodes includes a route descriptor table identifying routes fordata packets from said node, and further comprising the step of usingthe route descriptor table to identify a route for the collective ofdata packets from the node.
 24. A method according to claim 22, furthercomprising the step of initializing the plurality of local routedescriptors to reflect the local flow of data for the desired globaloperation associated with that class by selectively enabling the desiredinput and output directions.
 25. A method according to claim 24, furthercomprising the steps of: using the class identifier of the incomingpackets to select one of the local route descriptors; and using theselected route descriptor to determine the plurality of valid input andoutput directions.
 26. A method according to claim 25, furthercomprising the step of comparing the incoming direction of the packetwith the specified incoming directions of the packet class to determinewhether the packet is to be routed to the specified output directions orto the specified input directions.
 27. A method as specified in claim26, wherein an opcode associated with each packet or packet classindicates that the plurality of selected, specified input directionsdefines a local collective of packets which is subject to the saidcollective packet operation.
 28. A method a specified in claim 27,wherein the identified collective of packets is subject toarithmetical/logical data processing operations to combine the membersof the packet collective thereby reducing the number of packets to berouted through the network.